image:

CRISP-DM Modeling Heart Disease on Subjects in Cleveland, OH, USA

Kaggle Repository for this dataset is located at here

Business Understanding

Problem Statement

Predict heart disease in a patient from Cleveland Medical Center, OH, USA. Since this analysis and modeling covers risk to human life, it is important to maintain a high rate of identifying patients with heart disease (i.e True Positives, therefore high sensitivity) while maintaining a realistic error rate for those who do not have heart disease ( False Positives, redicting heart disease in a patient when there is none)

Context

This database contains data about the factors related to heart disease. There are 14 attributes from Cleveland Clinic Foundation. The “target” field refers to the presence or high risk of heart disease in the subject.

  • 0 represents no heart disease present
  • 1 implies presence of heart disease"

Content

Attribute Information:

  1. age
  2. sex
  3. chest pain type (4 values)
  4. resting blood pressure
  5. serum cholestoral in mg/dl
  6. fasting blood sugar > 120 mg/dl
  7. resting electrocardiographic results (values 0,1,2)
  8. maximum heart rate achieved
  9. exercise induced angina
  10. oldpeak = ST depression induced by exercise relative to rest
  11. the slope of the peak exercise ST segment
  12. number of major vessels (0-3) colored by flourosopy
  13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

Solution Approach

  1. Supervised Classification
  2. Apply 4 models- Logistic Regression, Decision Tree, Artificial Neural Network, Gradient Boosting
  3. Since prediction of target variable is a probability, use the models to figure out the probability threshold to delineate between no heart disease versus presence of heart disease

Data Understanding

Sample rows

##   X age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca
## 1 1  63   1  3      145  233   1       0     150     0     2.3     0  0
## 2 2  37   1  2      130  250   0       1     187     0     3.5     0  0
## 3 3  NA   0  1      130  204   0       0     172     0     1.4     2  0
## 4 4  56   1  1      120  236   0       1     178     0     0.8     2  0
## 5 5  57   0  0      120  354  NA       1     163     1     0.6     2  0
##   thal target
## 1    1      1
## 2    2      1
## 3    2      1
## 4    2      1
## 5    2      1

Statistical summary of dataset

##        X              age             sex               cp        
##  Min.   :  1.0   Min.   :29.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 76.5   1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :152.0   Median :56.00   Median :1.0000   Median :1.0000  
##  Mean   :152.0   Mean   :54.52   Mean   :0.6832   Mean   :0.9601  
##  3rd Qu.:227.5   3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.0000  
##  Max.   :303.0   Max.   :77.00   Max.   :1.0000   Max.   :3.0000  
##                  NA's   :45                       NA's   :27      
##     trestbps          chol            fbs            restecg      
##  Min.   : 94.0   Min.   :126.0   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:120.0   1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :130.0   Median :240.0   Median :0.0000   Median :1.0000  
##  Mean   :131.6   Mean   :246.3   Mean   :0.1429   Mean   :0.5281  
##  3rd Qu.:140.0   3rd Qu.:274.5   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :200.0   Max.   :564.0   Max.   :1.0000   Max.   :2.0000  
##                                  NA's   :30                       
##     thalach          exang           oldpeak         slope      
##  Min.   : 71.0   Min.   :0.0000   Min.   :0.00   Min.   :0.000  
##  1st Qu.:133.5   1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000  
##  Median :153.0   Median :0.0000   Median :0.80   Median :1.000  
##  Mean   :149.6   Mean   :0.3267   Mean   :1.04   Mean   :1.407  
##  3rd Qu.:166.0   3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000  
##  Max.   :202.0   Max.   :1.0000   Max.   :6.20   Max.   :2.000  
##                                                  NA's   :30     
##        ca              thal           target      
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :0.0000   Median :2.000   Median :1.0000  
##  Mean   :0.7294   Mean   :2.314   Mean   :0.5446  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :4.0000   Max.   :3.000   Max.   :1.0000  
## 

Summary of overall raw dataset

  1. 55% of subjects have heart disease
  2. 68% are males
  3. 44% males have heart disease
  4. 75% females have heart disease
## 
##         0         1 
## 0.4554455 0.5445545
## 
##         0         1 
## 0.3168317 0.6831683
## 
##         0         1 
## 0.5507246 0.4492754
## 
##    0    1 
## 0.25 0.75

Summary of missing data

There are 64% values in the data set with no missing values. There are 11% missing values in age, 6% in fbs, 5% in slope, and 5% in cp.

<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
## 
##  Variables sorted by number of missings: 
##  Variable      Count
##       age 0.14851485
##       fbs 0.09900990
##     slope 0.09900990
##        cp 0.08910891
##         X 0.00000000
##       sex 0.00000000
##  trestbps 0.00000000
##      chol 0.00000000
##   restecg 0.00000000
##   thalach 0.00000000
##     exang 0.00000000
##   oldpeak 0.00000000
##        ca 0.00000000
##      thal 0.00000000
##    target 0.00000000

Data Imputation

Missing data is imputed using predictive mean matching with Mice. Predictive Mean Matching (PMM) is a semi-parametric imputation approach. It is similar to the regression method except that for each missing value, it fills in a value randomly from among the a observed donor values from an observation whose regression-predicted values are closest to the regression-predicted value for the missing value from the simulated regression model (Heitjan and Little 1991; Schenker and Taylor 1996). The PMM method ensures that imputed values are plausible; it might be more appropriate than the regression method (which assumes a joint multivariate normal distribution) if the normality assumption is violated (Horton and Lipsitz 2001, p. 246).

Missing data after imputation.

Dataset is now complete as shown below.

<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
## 
##  Variables sorted by number of missings: 
##  Variable Count
##         X     0
##       age     0
##       sex     0
##        cp     0
##  trestbps     0
##      chol     0
##       fbs     0
##   restecg     0
##   thalach     0
##     exang     0
##   oldpeak     0
##     slope     0
##        ca     0
##      thal     0
##    target     0

Data Exploration

Correlation Matrix

Visualize with a correlation matrix how features are correlated with each other and with the target variable. No two features have a strong correlation (<50%).

<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da

Pairwise Correlations

  1. Gender
  • Heart disease is more prevalent in females than males
  • 45% males have heart disease
  • 75% females have heart disease
## 
##         0         1 
## 0.5507246 0.4492754
## 
##    0    1 
## 0.25 0.75

  1. Age

Rate of heart disease increases with age over 60

<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
  1. Cholesterol

Higher cholesterol increases rate of heart disease. We don’t know if this is good (HDL) or bad (LDL) cholesterol or total cholesterol

<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
  1. Fasting Blood Sugar

Of those subjects with fasting blood sugar > 120 mg/dl, majority have higher risk of heart disease

<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
  1. Resting ECG results

Subjects with higher resting ECG have a higher prevalence of heart disease

<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da

Data Preparation

Visualization of scale and outliers

We use the Rosner Tests to identify outlier.

Outcomes are:

  1. Needs scaling
  2. Four columns have outliers - resting BP (trestbps), Cholesterol (chol), Stress Test depression (oldpeak), Defects (thal)
<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da

Standardize data and outlier handling/processing

  1. Replaced outliers with 5th and 95th percentile values
  2. Scaled data

Visualize prepared data

  • Data is now scaled
  • Outliers have been replaced/processed
<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da

PCA and Reduction of features

Let’s attempt to reduce dimensionality of dataset with Principal Component Analysis (PCA).

We use two methods: 1. Spectral decomposition which examines the covariances / correlations between variables (princomp) 2. Singular value decomposition which examines the covariances / correlations between individuals (prcomp) While both methods can easily be performed within R, the singular value decomposition method (i.e., Q-mode) is the preferred analysis for numerical accuracy (R Development Core Team 2011).

    <<<<<<< HEAD
  • Both show that first 7-12 features give us 80%-90% of the variance, and that no one variable is overbearing.

  • =======
  • Both show that first 7-12 features give us 80%-90% of the variance, and that no one variable is overbearing.
  • >>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
  • Due to the high number of principal components relative to input features, we are not reducing any dimensions using PCA

  • PRINCOMP: Notice that first 7 components represent 85% of the variance

## Importance of components:
##                          Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     1.537633 1.1207524 1.0275455 0.9485534 0.88678665
## Proportion of Variance 0.265996 0.1413152 0.1187878 0.1012262 0.08847241
## Cumulative Proportion  0.265996 0.4073112 0.5260990 0.6273252 0.71579763
##                            Comp.6     Comp.7     Comp.8     Comp.9
## Standard deviation     0.85777369 0.65933913 0.59080520 0.52457814
## Proportion of Variance 0.08277801 0.04890883 0.03926975 0.03095921
## Cumulative Proportion  0.79857564 0.84748447 0.88675423 0.91771344
##                           Comp.10    Comp.11    Comp.12    Comp.13
<<<<<<< HEAD
## Standard deviation     0.45024183 0.42583390 0.37802272 0.33988365
## Proportion of Variance 0.02278408 0.02038075 0.01606111 0.01298375
## Cumulative Proportion  0.93953170 0.95991245 0.97597356 0.98895732
##                           Comp.14
## Standard deviation     0.31344936
## Proportion of Variance 0.01104268
## Cumulative Proportion  1.00000000
======= ## Standard deviation 0.44490937 0.42547108 0.38168019 0.32729705 ## Proportion of Variance 0.02226961 0.02036618 0.01638962 0.01205185 ## Cumulative Proportion 0.93998305 0.96034924 0.97673885 0.98879070 ## Comp.14 ## Standard deviation 0.3156490 ## Proportion of Variance 0.0112093 ## Cumulative Proportion 1.0000000

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da

PRCOMP: Notice that first 12 components represent 86% of the variance

## Importance of components:
##                          PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.540 1.1226 1.0292 0.9501 0.88825 0.85919 0.66043
## Proportion of Variance 0.266 0.1413 0.1188 0.1012 0.08847 0.08278 0.04891
## Cumulative Proportion  0.266 0.4073 0.5261 0.6273 0.71580 0.79858 0.84748
##                            PC8     PC9    PC10    PC11    PC12    PC13
## Standard deviation     0.59178 0.52545 0.44565 0.42617 0.38231 0.32784
## Proportion of Variance 0.03927 0.03096 0.02227 0.02037 0.01639 0.01205
## Cumulative Proportion  0.88675 0.91771 0.93998 0.96035 0.97674 0.98879
##                           PC14
## Standard deviation     0.31617
## Proportion of Variance 0.01121
## Cumulative Proportion  1.00000
<<<<<<< HEAD =======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da

Partition dataset

  1. Training - 75%, 229 subjects
  2. Testing - 25%, 74 subjects

Modeling

Logistic Regression Model

  1. Overview Logistic Regression is used when the dependent variable(target) is categorical, as in this case.

Types of Logistic Regressio -Binary Logistic Regression The categorical response has only two 2 possible outcomes. Example: Spam or Not -Multinomial Logistic Regression Three or more categories without ordering. Example: Predicting which food is preferred more (Veg, Non-Veg, Vegan) -Ordinal Logistic Regression Three or more categories with ordering. Example: Movie rating from 1 to 5

p value is an important evaluation metric.p-value helps you to decide whether there is a relationship between two variables or not.

The smaller the p-value this mean the more confident you are about the existence of relationship between the two variables. The origins of p-values comes form hypothesis testing in statistics. In hypothesis testing, you have two hypothesis: H0 (called the null hypothesis ) : There is no relationship between the two variables. H1 (called the alternative hypothesis): There exist a relationship between the two variables.

If the p-value is less than small threshold (often 0.05 is used), then you can reject the null hypothesis H0, which means that you decide that there is a relationship between the two variables.

  1. Run the model against training data using the Binomial algo

Looking at the p values the following variables are predictors of target variable

  • Gender(sex)
  • Type of Chest Pain (cp)
  • Resting ECG result (restecg)
  • Exercise induced Angina (exang)
  • Stress Test depression (oldpeak)
  • Number of colored veins (ca)
  1. Evaluate and refine the model -

    Compute average prediction for true outcomes TP are 78%, that is predicting presence of heart disease correctly 78% of the time TN are 26%, that is predicting no heart disease when there is none 26% of the time

## 
## Call:
## glm(formula = target ~ age + sex + cp + trestbps + chol + fbs + 
##     restecg + thalach + exang + oldpeak + slope + ca, family = binomial, 
##     data = trainData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4422  -0.3588   0.1597   0.4854   2.3002  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.6865     0.8106   0.847 0.397009    
## age          -0.4887     0.2563  -1.907 0.056579 .  
## sex          -1.7608     0.5116  -3.442 0.000578 ***
## cp            0.6577     0.2180   3.018 0.002548 ** 
## trestbps     -0.4288     0.2120  -2.023 0.043058 *  
## chol         -0.4255     0.2445  -1.740 0.081816 .  
## fbs           0.3893     0.6726   0.579 0.562792    
## restecg       0.6065     0.4000   1.516 0.129446    
## thalach       0.2833     0.2574   1.101 0.271035    
## exang        -1.1019     0.4682  -2.353 0.018608 *  
## oldpeak      -0.7465     0.2699  -2.766 0.005682 ** 
## slope         0.6164     0.4415   1.396 0.162670    
## ca           -0.9816     0.2314  -4.242 2.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 313.54  on 227  degrees of freedom
## Residual deviance: 158.16  on 215  degrees of freedom
## AIC: 184.16
## 
## Number of Fisher Scoring iterations: 6
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001613 0.156975 0.637273 0.552632 0.922450 0.999223
##         0         1 
<<<<<<< HEAD
## 0.2624057 0.7854937
======= ## 0.2491418 0.7983138 >>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
  1. Find the threshold probability to delineate between heart disease= 0 or 1
  • ROC curve will help find the threshold
    • We want high TRUE POSITIVES or Sensitivity for diagnosing heart disease, and are ok with higher false positive
    • Therefore the threshold is at (0.9,0.2) where 90% with heart disease are diagnosed correctly
    • At this the threshold is 0.5. This implies that a probability above 0.5 should be classified as heart disease
<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
  1. Evaluate model by predicting on test dataset and generating the ROC curve
  • With 0.5 as threshold, prediction on test dataset leads to accuracy of 89%

The accuracy of the test depends on how well the test separates the group being tested into those with and without the disease in question. Accuracy is measured by the area under the ROC curve. An area of 1 represents a perfect test; an area of .5 represents a worthless test. A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system:

.90-1 = excellent (A) .80-.90 = good (B) .70-.80 = fair (C) .60-.70 = poor (D) .50-.60 = fail (F)

##    
##     FALSE TRUE
##   0    24   12
##   1     8   31

## 
## Call:
## roc.default(response = testData$target, predictor = lrtest.prob,     plot = TRUE, col = "blue")
## 
<<<<<<< HEAD
## Data: lrtest.prob in 35 controls (testData$target 0) < 39 cases (testData$target 1).
## Area under the curve: 0.896
======= ## Data: lrtest.prob in 36 controls (testData$target 0) < 39 cases (testData$target 1). ## Area under the curve: 0.8533 >>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
  1. Plot the logistic regression model against predictor variables
    • Gender(sex)
    • Type of Chest Pain (cp)
    • Resting Heart Rate (restecg)
    • Exercise induced Angina (exang)
    • Stress Test depression (oldpeak)
    • Number of colored vessels (ca)
  2. Gender versus Target variable
  • Heart disease is more prevalent in females than males
<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
  1. Type of Chest Pain versus Target variable
  • Heart disease is predicted by types 1,2 and 3 of chest pain
<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
  1. Resting ECG result versus Target variable

Higher ECG result predics heart disease

<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
  1. Exercise Induced Angina versus Target variable

Higher exercise induced angina predicts heart disease

<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
  1. Stress Test Depression versus Target variable

Lower depression during stress test predicts heart disease

<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
  1. Number of Colored Vessels versus Target variable
  • Heart disease is predicted by lower number of colored vessels
<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da

Decision Tree algorithm

  1. Overview A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This flowchart-like structure helps you in decision making. It’s visualization like a flowchart diagram which easily mimics the human level thinking. That is why decision trees are easy to understand and interpret.

  2. Run the model against training data, predict target on test data

  3. Evaluate the model -

ROC curve shows accuracy of this model is 82%

<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
## 
## Call:
## roc.default(response = testData$target, predictor = factor(dt.predict,     ordered = TRUE), plot = TRUE, col = "blue")
## 
## Data: factor(dt.predict, ordered = TRUE) in 36 controls (testData$target 0) < 39 cases (testData$target 1).
## Area under the curve: 0.7179

Artificial Neural Networks (ANN) Model

  1. Overview

  2. Run the model against training data, predict target on test data

  3. Evaluate the model -

ROC curve shows accuracy of this model is 82%

<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
## 
## Call:
## roc.default(response = testData$target, predictor = factor(annResult,     ordered = TRUE), plot = TRUE, col = "blue")
## 
<<<<<<< HEAD
## Data: factor(annResult, ordered = TRUE) in 35 controls (testData$target 0) < 39 cases (testData$target 1).
## Area under the curve: 0.8216
======= ## Data: factor(annResult, ordered = TRUE) in 36 controls (testData$target 0) < 39 cases (testData$target 1). ## Area under the curve: 0.6677 >>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da

Gradient Boosting Model

  1. Overview This uses ensemble models like weak decision trees. These decision trees combine together to form a strong model of gradient boosting.

Gradient Boosting Machine (for Regression and Classification) is a forward learning ensemble method. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations. H2O’s GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.

  1. Run the model against training data, predict target on test data

  2. Evaluate the model -

ROC curve shows accuracy of this model is 86%

<<<<<<< HEAD

=======

>>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
## 
## Call:
## roc.default(response = testData$target, predictor = gbm.test,     plot = TRUE, col = "red")
## 
<<<<<<< HEAD
## Data: gbm.test in 35 controls (testData$target 0) < 39 cases (testData$target 1).
## Area under the curve: 0.8557
======= ## Data: gbm.test in 36 controls (testData$target 0) < 39 cases (testData$target 1). ## Area under the curve: 0.7949 >>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da
## 
## Call:
## roc.default(response = testData$target, predictor = gbm.test,     plot = TRUE, col = "red")
## 
<<<<<<< HEAD
## Data: gbm.test in 35 controls (testData$target 0) < 39 cases (testData$target 1).
## Area under the curve: 0.8557
======= ## Data: gbm.test in 36 controls (testData$target 0) < 39 cases (testData$target 1). ## Area under the curve: 0.7949 >>>>>>> f66afae07c4f1e429c8833c30f2816f57b95f9da